-
Notifications
You must be signed in to change notification settings - Fork 858
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the actual XRay remote sampler which polls rules and orders / app… #3343
Conversation
Codecov Report
@@ Coverage Diff @@
## main #3343 +/- ##
===========================================
+ Coverage 0 18.46% +18.46%
- Complexity 0 721 +721
===========================================
Files 0 351 +351
Lines 0 9618 +9618
Branches 0 972 +972
===========================================
+ Hits 0 1776 +1776
- Misses 0 7657 +7657
- Partials 0 185 +185
Continue to review full report at Codecov.
|
return new AwsXrayRemoteSamplerBuilder(resource); | ||
} | ||
|
||
AwsXrayRemoteSampler( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: this is my defensive thinking, but as the constructor is package-protected and thus accessible, I'd validate constructor parameters here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they're all validated in the builder
|
||
pollFuture = | ||
executor.scheduleAtFixedRate( | ||
this::getAndUpdateSampler, 0, pollingIntervalNanos, TimeUnit.NANOSECONDS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that pooling interval should have a sanity check - that is if the value is outside of typical, expected values, a warning should be logged. For me it's easy to imagine sb setting interval to 1 (assuming it's one second) and degrading the performance with request each nanosecond ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like all the validation is down in the builder.
|
||
// In practice, X-Ray always returns a Default rule that matches all requests so it is a bug in | ||
// our code or X-Ray to reach here, fallback just in case. | ||
logger.log( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👌
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm. In the case this happens, it will totally spam the logs at WARNING for every span. I know it shouldn't ever happen, but should we maybe not be so aggressive with the logging here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True that the fallback loses its meaning if it adds destructive spam on every single span, turned it to FINE (thought about throttling logger but will lose access to it in contrib repo 😿 )
Hm. If that's the case, I'm definitely starting to question if this repository is the right place to host the code. Can AWS publish a library with all this code in it that we can just depend on, rather than having it all live here (as well in wherever it lives for the old AWS java instrumentation)? |
I think code for otel components that interface with vendors generally go in an otel contrib repo. We've been treating the sdk-extensions as a sort of contrib but maybe we should move to java-contrib? Not sure if it makes a difference in the actual scheme of things though - if the issue is review load a better option (though not mutually exclusive) is probably to add @kubawach to approvers finally :) |
The issue isn't review load (but I'm also in favor of getting @kubawach added to approvers), but potential maintenance & support load over the long term. If you are the only person who is really able to maintain the code, and debug issues with its interaction with AWS services, then that feels like the core implementation details probably don't belong here. A simple AWS propagator is one thing, but when we're talking about maintaining several thousand lines of non-trivial code for an AWS-specific sampler implementation, I'm probably against it. |
…nto xray-remote-sampler
this.resource = resource; | ||
this.initialSampler = initialSampler; | ||
client = new XraySamplerClient(endpoint); | ||
executor = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
personally, I'd just assign this in the declaration above, since it's final and doesn't depend on params.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to like either all assigned as fields or all in constructor, otherwise it seems like an ugly duck field
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Looks good enough to merge, then move over to the contrib repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👀
Speaking for myself (and my employer), we have a strong interest in being able to approximately count sampled spans. I am also confident we can do this counting correctly and with a specification, as discussed in open-telemetry/oteps#148. From this perspective, it's the "Priority", "FixedRate", and "ReservoirSize" that are involved in sampling probability. As discussed in OTEP 148 Samplers are capable of producing probabilities for fixed rate sampling, we'll just have to replace Leaky-bucket rate limiting with Varopt reservoir sampling.
I'd like to draw a non-blocking connection between the XRay sampler and the discussion about Views taking place in the Metrics SIG. Sampler configurations are like views on spans, and as long as we're able to count them, they are also views that can produce metrics. The big question ia whether this configuration syntax can become a standard for sampling in OTel's default SDKs, and if so--how is configuration is supplied?
@jmacd Thanks - I agree that any rate limiting can be made compatible with the need for counting spans by tweaking the algorithm, perhaps to Varopt. In the meantime I'll be starting with the same as Jaeger but will be happy to change in the future.
Configuration is basically just data, not that much unlike even spans - just the data that's sent is about configuring an app, not an event. I'd expect we define a proto and transmit it over gRPC just like OTLP, there doesn't seem to be a reason to break from convention. What's in the proto will be a finer nuance - I wouldn't expect to crib much from the X-Ray data model for sampling since it's not very intuitive / well designed I think. In particular, reasoning out the result of having both rate limiting and fixed rate possible at the same time is mind blowing. Adding attribute matching to the jaeger data model is probably a better start. |
…lies them.
This implements the core part of XRay remote sampling, which polls rules and applies a fixed rate based on matching rules. This is the most important part, but in followups I will also be adding support for XRay's "centralized reservoir" (where the server handles the reservoir quotas instead of being fixed per client, this is what the GetSamplingTargets RPC is used for). @jkwatson You aint seen complicated yet :P